Real-Time Player Detection From A Single Calibrated Camera

ABSTRACT

A method for detecting the location of objects from a calibrated camera involves receiving an image capturing an object on a surface from a first vantage point; generating an occupancy map corresponding to the surface; filtering the occupancy map using a spatially varying kernel specific to the object shape and the first vantage point, resulting in a filtered occupancy map; and estimating the ground location of the object based on the filtered occupancy map.

CROSS REFERENCE TO RELATED APPLICATIONS

This application is related to, and claims priority from, U.S.Provisional Patent application No. 61/561,640 filed on Nov. 18, 2011 byG. Peter K. Carr entitled “Real-Time Player Detection from a SingleCalibrated Camera”, the contents of which are hereby incorporated byreference.

FIELD OF THE INVENTION

The exemplary embodiments relate to systems and methods thatautomatically estimate the positions of objects on a surface in a video,for example, the positions of players on a playing surface in a camerafeed. ECCV-2012 conference paper, titled “Monocular Object DetectionUsing 3D Geometric Primitives”, by Peter K. Carr et al., is herebyincorporated by reference.

BACKGROUND INFORMATION

A prior automatic method of detection of persons in a video involves theuse of ‘histograms of oriented gradients’ (“HOG”) as an effective meansfor detecting pedestrians within arbitrary still images. Although themethod is quite effective at finding people, the approach is verycomputationally intensive and therefore slow. Moreover, the HOGpedestrian detector must be trained on a large set of manually labeleddata. Publicly available implementations trained on images ofpedestrians may not be able to handle the more complex poses of sportsplayers and vantage points which were not included in the trainingdatabase. Additionally, the HOG descriptor only uses the informationfrom a single frame of video. If a continuous action sport is observedfrom a stationary camera, temporal information is also available.

Background subtraction is a second automatic method for detecting movingobjects. In a background subtraction process each pixel in the videoframe is compared to its corresponding pixel in the previous frame or toits corresponding pixel in a reference image that models the backgroundscene possibly based on temporal history. Typically, the output of abackground subtraction process is a binary background mask indicating ifthe corresponding pixel in the video frame is a foreground (“1”) or abackground (“0”) pixel. Alternatively, a background mask may indicatethe probability of a pixel being a foreground pixel and hence assumescontinuous values. Various background subtraction embodiments aredescribed in U.S. patent application Ser. No. 12/403,857, incorporatedby reference herein in its entirety. Note that other methods in the artmay be used for foreground detection. For example, a background mask maybe generated from depth information (either from a single structuredlight camera or via the disparity map from a stereo video pair).Similarly to estimating background appearance from the temporal history,modeling the geometry of the scene through temporal history of the depthinformation (or from a combination of camera's parameters and scenegeometry model) is known in the art. Other modalities such as thermalcameras may be used as well for foreground detection.

Background subtraction is very efficient as an initial foregrounddetection step. Although background subtraction is fast, straightforwardimplementation of it is also fairly naive and may produce incorrectresults. Camera shake, for instance, causes many false foregrounddetections. This is especially true in high-definition video of outdoorsports, since there is strong contrast between grass and pitch markings.Camera shake, though, may be handled by compensating for vibration whencomparing two consecutive frames or when comparing the current frame toa reference (background) image. However, this requires an additionalstep of image registration that is computationally involved. Anothercomplexity is small appearance changes, caused by rain, snow, andshadows cast by players, for example. Shadows may be detected asforeground objects and are difficult to model. Another challenge is todiscriminate between foreground and background regions with similarappearance: players' green uniforms may be mistaken for grass, and theirtorsos will not be detected as foreground, for example. A robustinterpretation of background subtraction results is therefore importantfor reliable foreground detection.

Typically, a sensitivity threshold is used to determine whether aparticular pixel is a foreground pixel. Hence, a pixel may be determinedto be a foreground pixel based on its lack of similarity to the averageof recent previous values or based on its lack of similarity to thecorresponding pixel in a reference (background) image. Individualforeground pixels are then clustered into ‘blobs’ by finding theconnected components in the binary image (background mask). Since thebackground mask may be noisy, a single connected component may be notcorresponding to a complete object. Instead, a second clustering of‘blobs’ into ‘objects’ is often performed. In this step, it is quitedifficult to determine automatically (1) how many ‘objects’ exists, (2)which ‘blobs’ associate to which ‘objects’, and (3) the identificationof ‘blobs’ which do not correspond to any ‘objects’ (in these caseswhere the ‘blobs’ were incorrectly identified as foreground regionsduring background subtraction).

The association of ‘blobs’ to ‘objects’ is often ambiguous. Therefore,this particular technique of extracting objects from blobs may beunreliable. For instance, the outstretched leg of one player could beconsidered as the arm of another player. Complex association heuristicsmay work in some situations, but they tend to fail catastrophically inother circumstances. To improve the robustness of the ‘blobs’ to‘objects’ association, small ‘blobs’ are often removed in apreprocessing stage of image erosion followed by image dilation.However, this method could easily discard minute correct detections,such as a player on the far side of the field. Additionally, if manysmall ‘blobs’ are present in the image, the connected componentsalgorithm (which groups pixels into ‘blobs’) may take an excessivelylong time to process. For instance, when it is raining, the image willbe littered with many small false detections or blobs may contain morethan one player.

Once an object's foreground was detected in the image space, itsposition at the scene is computed. The 3D ground position X of a player,for instance, is estimated by finding the 2D location x of their feet inthe image and mapping these locations to the ground plane using ahomography H obtained from a full or partial camera calibration: x=HX.Knowledge of the homography, H, allows for one-to-one association of a3D point at the scene, X, with its corresponding 2D point at imagespace, x. In order to uniquely associate a point at image space, x, withits corresponding point at the scene, X, the homography is expressedwith respect to a specific plane in the 3D world. Specifically, the 3Dlocation of a pixel at the player's head-top can be found based on thehomography and knowledge of the player's height (i.e. the 2D planecrossing X=(x,y,h)). Similarly, the 3D location of a pixel at theplayer's foot can be found based on the homography and knowledge ofplayer's feet level (i.e. the 2D plane crossing X=(x,y,0)). Hence, inorder to map a pixel in image space to its corresponding 3D point at thescene, knowledge of this point height, for example, is necessary.

A method to interpret background subtraction results on the ground plane(instead of the image plane) involves the realization that homographywhich maps the object's foreground to the 3D scene, assuming all pixelsat ground level, is only valid for those pixels which are a projectionof part of the object that actually exists on the ground (such as feetand shadows). This mapping of objects' foreground from the image spaceto a plane in the 3D scene resulting in a ground map named “occupancymap” that may be an aggregation of mapping results from several cameraspositioned at different vantage points. Therefore, a player's feet willconsistently map to the same ground position from multiple vantagepoints, but the upper body will map to different areas. As a result,determining the number of players and their positions in the world isequivalent to finding local peaks in the occupancy map produced byaggregating the mapping of foreground regions detected from multiplevantage points onto the ground plane. In addition to avoiding theheuristics of clustering ‘blobs’, this method also avoids the thresholdstage of background subtraction, as no further analysis involves binaryimage processing. More details about muliview occupancy map can be foundin Khan S. M. and Shah, M., “Tracking multiple occluding people bylocalizing on multiple scene planes”, Pattern Analysis and MachineIntelligence, IEEE Transactions on, vol. 31, no. 3, pp. 505-519, March2009.

The insight that a player's feet will map to the same ground location inall camera views does not apply for just the z=0 plane. If the camera isfully calibrated, it is possible to map the image onto any plane.Following a similar logic, the top of a player's head shouldconsistently map to the same (x, y) location on the z=h plane (assumingthe player is h meters tall and standing upright). In fact, some part ofthe player's body will map to the same (x, y) location for any plane0≦z≦h parallel to the ground. A height-specific occupancy map generatedfrom multiple parallel planes produces local peaks which are much moredominant than the local peaks of an occupancy map produced from just theground plane. Such occupancy maps have been generated using multiplecameras from different vantage points. In addition to a player's feet,the ground plane will also have consistent mappings between shadows.However, on horizontal planes above the ground, the shadows will not mapto consistent locations. As a result, estimating the (x, y) location ofthe player by fitting a vertical axis through multiple planes of data ismuch more reliable than simply searching for the midpoint of the feet onthe ground plane.

Mapping to multiple planes is more computationally expensive. In thecontinuous case, the integration over an infinite number of planes canbe projected back into the image. Essentially, this corresponds topre-computing bounding convex hulls in the image plane for every (x, y)location on the ground. Summing the number of foreground pixels withineach convex hull is also an expensive computation. However, if the imageis warped such that convex hulls become rectangles, the computation canbe optimized using integral images. Although, one can approximate theconvex hulls as rectangles, a warp which sends the vertical vanishingpoint to infinity accomplishes the necessary rectification of the convexhulls, and is equivalent to tilting the camera so that it is level.However, if the camera is horizontal, the precision at which the (x, y)location can be estimated is greatly reduced. For high vantage pointsthe image may be warped such that the optical axis of the camera isperpendicular to the ground. In this perspective, the ability tolocalize objects on the ground plane is optimal, but the ability toidentify objects of height h is minimal. As a result, the integral imageoptimization requires many views to get accurate positions of objectswhich are h meters tall. For large outdoor playing areas, many cameraswill be needed, and a significantly high vantage point for anapproximate overhead view may not be possible. Additionally, sincesimultaneous access to the raw pixel data from every camera is needed tolocalize players, there is a high bandwidth requirement for real-timeanalysis, as all data must be analyzed at a single location.

The above approaches have exhibited numerous disadvantages. Thoserelying on a single-camera approach have been either slow (HOG) orunreliable (“blob” clustering). Other approaches have relied on thefusing of all the video data from multiple cameras simultaneously, inorder to provide a central processing location with simultaneous accessto all pixels from all cameras. Performing player detection on thisbasis requires significant bandwidth; in fact, the bandwidth of gigabitEthernet limits these approaches to only two or three high definition(HD) cameras. A disadvantage with this fusion approach is that it doesnot scale well to a large number of cameras, since all the pixel datamust be transmitted to a central location for processing. It would beadvantageous, in order to ensure greater player detection accuracy andavoid false detections, if a non-fusion method for detecting players inreal-time with minimal latency (for example, latency of one frame) couldbe scaled to a significantly larger number of cameras.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1A shows an exemplary embodiment in which a field hockey player 100is illustrated being recorded by camera 105.

FIG. 1B shows a triangle response based on FIG. 1A.

FIG. 2 is a flow diagram illustrating an operation of an exemplaryembodiment of the present application.

FIG. 3 illustrates a template as defined by a fixed number of spatialsamples.

FIG. 4 illustrates a flow diagram for generating the template of FIG. 3.

FIG. 5 illustrates a set of normalized sampling locations.

FIG. 6 is a diagram illustrating the situation in which a player jumps aheight j.

FIG. 7 shows a block diagram of a camera equipped to perform the objectdetection method described herein.

FIG. 8 shows a plurality of cameras deployed around a sports venue, suchas a stadium.

FIG. 9 is an illustration of how objects of non-zero width will havemultiple cross sections from the camera's vantage point.

FIG. 10 shows an expected trapezoidal pattern along a cross-section at aparticular location.

DETAILED DESCRIPTION

The approach of at least one embodiment of the present invention may bedescribed as an analysis of occupancy maps derived from calibratedmonocular video. According to an embodiment, objects of a known size andshape are detected in real-time from a single vantage point. In contrastto the method discussed in the Background requiring the fusing of allvideo data from multiple cameras, the approach of an embodiment of thepresent invention involves the creation of a height-specific occupancymap from a single camera, and involves the individual processing of eachcamera captured image and possibly the sending of only the detectionresults to a central location. The height-specific occupancy mapestimates the probability of a location (x, y) on the ground plane beingoccupied by an object of a certain shape (based on a measure offoreground likelihood, provided by the background subtraction results,for instance).

The exemplary embodiments may be further understood with reference tothe following description of the exemplary embodiments and the relatedappended drawings, wherein like elements are provided with the samereference numerals. The exemplary embodiments are related to systems andmethods for detecting objects in a video image sequence. The exemplaryembodiments are described in relation to the detection of players in asporting event performing on a playing surface, but the presentinvention encompasses as well the tracking of pedestrians in crowd flowapplications, such as would be used in parks and resorts, or thedetecting of other objects such as vehicles. The method of the exemplaryembodiments may be advantageously implemented using one or more computerprograms executing on a computer system having a processor or centralprocessing unit, such as, for example, a computer using an Intel-basedCPU, such as a Pentium or Centrino, running an operating system such asthe WINDOWS or LINUX operating systems, having a memory, such as, forexample, a hard drive, RAM, ROM, a compact disc, magneto-optical storagedevice, and/or fixed or removable media, and having a one or more userinterface devices, such as, for example, computer terminals, personalcomputers, laptop computers, and/or handheld devices, with an inputmeans, such as, for example, a keyboard, mouse, pointing device, and/ormicrophone. The computer system may include Graphics Processing Units(GPUs), be implemented using a Field-Programmable Gate Array (FPGA), orbe part of an embedded system specifically designed for embodiments ofthis invention to meet real-time constraints.

The exemplary systems and methods may be applied to any type of objectdetection system that involves the detection of objects in the videoimage sequence. It is also noted that in the above description and thefollowing description, the exemplary event sites are described assporting events sites. However, the exemplary embodiments are notlimited to such remote event sites. The detection method of the presentembodiments may be described as having two stages. First, aheight-specific occupancy map is generated (preferably at each frame)and, second, a spatially varying filter is applied to identify objectsof a specific width and depth (involving the extraction of local peaksto determine XY locations on the ground). One example of a spatiallyvarying filter function that can be used in the exemplary embodiments isa spatially varying kernel, which may be embodied as a 2 dimensionalmatrix, but in general can be embodied as a 1 or three dimensionalmatrix as well. With regard to the objects to be identified, forexample, in the case of detecting players, one embodiment of the methodapproximates them as cylinders 1.8 m tall and 0.5 m wide.

FIGS. 1A and 1B, in conjunction with the flow diagram of FIG. 2, arediagrams illustrating the principles of player detection using a singlecalibrated camera, according to an embodiment of the present invention.In FIG. 1A, a field hockey player 100 is illustrated being recorded bycamera 105. (Step 200 in FIG. 2). The camera may be oriented, forexample, to look down at an angle of 45°. The process employs abackground subtraction method resulting in a foreground mask coveringthe foreground region R of the image plane. (Step 205). Although anembodiment of the present invention can use a binary foreground mask,the present invention may also use continuous measures, meaning pixelsin the foreground mask assume bounded values (e.g. between 0 to 1)representing the probability of being part of a foreground region. Next,pixels from the foreground mask warped onto multiple horizontal planes(e.g. the parallel to the ground planes ab and cd) based on the givenhomography. (Step 210). Working in the geometry of the image plane isnot ideal because the near side of the field is oversampled, and the farside is undersampled. Therefore, instead of warping the foregroundregion from the image space onto horizontal planes at the scene's 3Dspace a backward mapping may be preferred. Hence, an embodiment of thepresent invention involves setting a grid of equally spaced points onthe scene's ground (such as every 10 cm, for example) and projectingthese points back into the image. The process then calculates theprobability of whether the feet of a player exist at this ground planelocation based on the value of the background mask at the projectedpoint location.

Back to FIG. 1A, for convenience of explanation, a region R of the imageplane is considered, which has been detected as foreground, and thisregion corresponds to a complete individual. On each horizontal planebetween z=0 and z=h some portion of R will map to the true position ofthe player (illustrated by line segment be in FIG. 1A). The player'sfeet will map to the correct (x, y) position on the ground plane (z=0plane), but his/her head will map to a very different location on theground plane. Similarly, a player's head will map to the correct (x, y)location on the z=h plane, but his/her feet will appear at a locationmuch closer to the camera on that plane. Generally, some part of theplayer's body will always map to the correct (x, y) location on aparticular horizontal plane between z=0 and z=h (the rest of the bodywill map in front of and behind the true location). The volume ofinterest in the 3D world is divided into small volume elements. Anoccupancy value for the volume element at X=(x,y,z) is determined byprojecting the 3D position X into the camera's image and interpolatingthe image of foreground likeliness at location x. Finally, by summingthe occupancy scores of the vertical column of volume elements locatedat (x,y) (step 215), a height-specific occupancy map is generated. Themap is then searched for a triangle response 110 (FIG. 1B) to bedetected (step 220) along the camera/player cross-section, with the apexof the triangle corresponding to the true (x, y) location of the player,assuming the player is isolated in region R. The extent of the response(indicated by q and s in FIG. 1B) is determined by the distance r fromthe camera along the ground and the height Cz of the camera. If playersare suitably close together along the line of sight of the camera (suchas a distant player's feet being near the head of a player close to thecamera), the triangle responses may overlap. In such a situation, theprocess may resolve this overlap by turning to a different camera thatis recording the event from a viewpoint where the overlap is notpresent.

A graphics processing unit (GPU) residing in the camera can compute theheight specific spatial occupancy map for a resolution of 10 pixels/m,for example (other resolutions may be used). Every output pixel in theoccupancy map is computed directly from the input background mask usingthe camera's projection matrix (homography). The GPU implementation ofthe method described in FIG. 2 is carried out as follows. To calculatethe occupancy map value at a pixel corresponding to a certain groundlocation (x,y,0), a vertical segment starts at (x,y,0) and ranges to theexemplary position (x,y,h=1.8 m) is sampled at 5 samples/m (other rangesand sampling resolution may be used, depending on the object to bedetected). Next, each vertical sample, X_(i), is mapped to thecorresponding location in image space using the camera's projectionmatrix P: x_(i)=PX_(i). Then the background mask is interpolated atthese projected locations x_(i) resulting in a corresponding mask valuefor each sample. Finally, the average of all samples' mask values iscomputed. As a result, if the entire 1.8 m tall player is mapped ontothe player's corresponding foreground region on the image plane, the(x,y) location in the spatial occupancy map will have a value of 1.Similarly, if all samples are mapped onto a background region, thespatial occupancy value will be 0.

The number of players and their locations can be determined by searchingthe height-specific spatial occupancy map for expected cross-sectionalpatterns, such as triangles for the case of tall, thin cylinders(players). The exact nature of an expected pattern depends on the sizeand shape of the object, its 2D location on the ground plane, and the 3Dposition of the camera. In other words, the expected pattern is“spatially varying” because each discrete location in the 2Dheight-specific occupancy map will have a unique template defined by aspecific set of cross-sectional patterns which indicate the presence ofan object of particular size and shape at this specific (x,y) location.The “spatially varying” nature of the template makes efficient searchingdifficult. For a particular location X in the height-specific occupancymap generated for players, the triangular cross-sectional pattern willbe oriented along the line connecting X and Cxy (the projection of thecamera center onto the ground plane). The extent of the triangle infront of and behind X (with respect to the ground location Cxy of thecamera) is, respectively, q=hr/Cz and s=hr/(Cz−h). Where r is thedistance between X and Cxy.

FIG. 9 shows how objects of non-zero width will have multiple crosssections from the camera's vantage point. The expected trapezoidalpattern along a cross-section at a particular location is shown in theFIG. 10. Therefore, the template of an object is the set ofcross-sectional patterns along a number of cross-sections spanning theextent of the object.

In addition to a spatially varying template the height-specific spatialoccupancy map contains errors propagated from background subtraction.Therefore, achieving well localized strong template matches will bedifficult. Furthermore, not all players will be exactly 1.8 m tall.Shorter players will generate smaller template matching filterresponses; taller players will produce larger, saturated responses. As aresult, template matching is treated as a filtering process, whichallows for the search of significant local peaks in the filteredresponse. The template matching performed by the exemplary embodimentssearches for isolated individuals; if multiple vantage points frommultiple deployed cameras are used in a crowded scene, the detectionaccording to the exemplary embodiments assumes each individual willappear in isolation in at least one view.

According to this approach of the exemplary embodiments, in a situationwhere a plurality of cameras located at different vantage points areeach performing the detection method taught herein in parallel andindependent of each other, the pixel data from each camera may beprocessed in parallel, so that only low-bandwidth detection results arefused at a central location. On account of the lower bandwidthrequirements, the exemplary embodiments permit a larger number ofcameras to be used than the prior art approach. These cameras detectindividuals by transmitting the location data found independently byeach camera to a centralized location for fusion and subsequentprocessing.

The GPU that may be associated with the camera recording the video imagecontaining region R computes a template matching score for each locationin the height-specific spatial occupancy map by comparing the actualresponse to its theoretical triangle response. Template similarity canbe evaluated at each pixel by sampling values from the surrounding localarea. For efficiency, as seen in FIG. 3 and represented in the flowdiagram of FIG. 4, the template is defined as a fixed number of spatialsamples. (Step 400). The local coordinate system is defined by thedirection n̂ of the cross-section between X and Cxy. The positive andnegative half-planes (defined by n̂) are scaled so that both sides of thetriangle response have positive or negative unit slope. To determinewhether location X in the occupancy map coincides with the apex of anisolated triangle response, samples 305 are taken along thecross-section direction n̂ (which points from X to Cxy) (step 405) andalong parallel offsets 310 to the cross-section (step 410). Forconvenience, the n component of the sampling locations arescaled/normalized by 1/q or 1/s so that the template has positive andnegative unit slopes (step 415) (FIG. 5). The sampling locations in FIG.5 are normalized so that the sampled values will match a triangleresponse with a unit slope. Turning back to FIG. 3, samples 315 beyondthe range of the triangle response are also taken to ensure the triangleis isolated—i.e. padded by zero occupancy values before and after (step420). Finally, note that spatial samples 310 offset by ±w ̂m are used toensure the triangle response is isolated laterally. Both additionalsamples, 310 and 315, test for zero response and are used to ensure thatthe triangle response is isolated, by ensuring zero values around theperimeter. The scaling parameter w represents the minimum separationdistance between player positions, which may be 0.5 m, for example.

The similarity score between the expected and actual responses isexpressed as the sum of squared differences (SSD) for all samples (step425). Alternatives such as sum of absolute differences (SAD) ornormalized cross correlation (NCC) may also be used. The SSD values arescaled relative to the template similarity score of an unoccupiedregion. A perfect match will have a score of 0, while a match tounoccupied space will have a score of 1. The inverted normalized SSDscores are searched for significant local peaks (those above 0.5) (step430). Significant local maxima (peaks) in the filtered occupancy mapindicate the presence of isolated objects (players). However, because ofnoise (and possibly the sampling resolution), the peaks will be spreadout and corrupted with noise. As a result, a robust peak findingalgorithm, for example a mean shift algorithm is used to determine thenumber of players and their precise (x,y) locations (step 435).

The mean shift algorithm iteratively updates a set of “modes” whichcorrespond to identified significant local peaks. At each iteration ofthe mean shift algorithm, data points are assigned to the closest mode,and the mode is updated using a weighted average of the assigned datapoints. The iterations continue until every data point continues to beassociated with the same mode (no data point changes its modeassociation in successive iterations). As the algorithm proceeds, modeswhich are sufficiently close coalesce into a single mode. In thisembodiment, the data points are generated by identifying all (x,y)locations in the filtered occupancy map which are above a particularvalue. The initial modes are generated by associating a single mode toeach data point. At each iteration, each data point is assigned to itsnearest mode by calculating the Mahalanobis distance, where thecovariance of each data point is determined by the image to groundnomography. In this situation, the mean shift parameters (of closenessfor coalescing and covariance for Mahalanobis distance) relate directlyto geometry, and a priori values can be used without needing to tune foreach specific situation. The single view methodology of the exemplaryembodiments avoids introducing parameters requiring tuning for eachspecific application. According to the exemplary embodiments, theparameters are based on the underlying geometry of the scene. Only thecamera calibration parameters and extent of the playing surface need tobe estimated or specified during employment. To achieve real timeperformance, the mean shift algorithm may be terminated after a finitenumber of iterations, even if some data points continue to alternatebetween assigned closest modes.

Instead of computing a unique template for each pixel, the spatialoccupancy map can be re-sampled using polar co-ordinates. In thisco-ordinate system, the template would no longer have varyingorientation. However, size would still change in the vertical andhorizontal directions. The polar transform would be particularlysuitable for smaller venues. For larger areas, where the necessaryangular resolution to maintain the same spatial resolution (e.g., 10pix/m) on the far side of the playing surface (e.g., approximately 50 m)may require a great deal of memory for polar coordinates. As analternative to mean shift, a threshold may be applied to the filteredoccupancy map to find areas which have significantly high occupancyscores, the resulting binary image should contain many small ellipticalareas. Each ellipse represents the presence of an isolated object, andthe center of the ellipse is a good estimate of the object's precise(x,y) location. Finally, different distributions of spatial samplescould be used when evaluating how likely a pixel in the spatialoccupancy map coincides with the apex of a triangle response.

FIG. 6 is a diagram illustrating the situation in which a player jumps aheight j. As shown in FIG. 6, a player of height h standing at groundlocation x jumps j meters above the ground. The back-projection of theplayer's head and feet delineate the integration bounds between z=0 andz=h. In this situation, the cumulative response resembles a trapezoid,not a triangle. As a result, the algorithm will estimate the player'sposition ̂x as the mid-point of the top edge of the trapezoid. In thissituation, the incorrect ground plane assumption creates two sources oferror. First, the player's position has a bias of dx further from thecamera. Second, the player appears to be dh taller. The magnitudes of dxand dh depend on the height Cz of the camera, the distance x along theground between the player and the camera, the height j of the jump andthe height h of the player.

${dh} = {\frac{jh}{2}\left( \frac{1}{C_{z} - j} \right)}$${dx} = {\frac{jx}{2}\left( {\frac{1}{C_{z} - j - h} + \frac{1}{C_{z} - j}} \right)}$

Generally, a machine vision camera will be located quite high in theair, and a player's jump will be much smaller than his/her height, whichimplies

j<<h<<C_(z).

As a result, approximate expressions for dh and dx are

${dh} \approx \frac{jh}{2C_{z}}$ ${dx} \approx {\frac{jx}{C_{z}}.}$

One way to minimize the position error dx is to place the cameras ashigh as possible. Mathematically, as C_(z)→∞ the position error dx→0.For example, if the camera was placed on the ceiling looking directlydownwards, the camera would not observe the vertical displacement of aplayer as he/she jumps. Another situation is when the camera is not highenough to neglect other effects. As long as the player is at a locationx>(½)h (which should be typical in most set-ups), the change in positiondx may be larger than the change in height dh. As a result, theoccurrence of a jumping player will be easier to detect by testing for anon-zero dx value (the magnitude of dh will be insufficient to identifyplayers that appear too tall). From a single image, there may be nodefinitive way to test whether the detected position ̂x=x+dx has anon-zero ∥dx∥ bias. If the system is tracking a series of detections { .. . ,{umlaut over (x)}_(t-2),{umlaut over (x)}_(t-1),{umlaut over(x)}_(t)}, then it may be possible to test for discrepancies in smoothtrajectories (especially since the bias will be along the line ofsight). However, in practice, that is probably not feasible, asdetection noise and tracking errors will probably dominate any ∥dx∥offset from jumping. Instead, observations from two different vantagepoints will show that the detection of a jump becomes much morenoticeable. When the player is on the ground, cameras 1 and 2 shouldobserve consistent ground positions—i.e. ̂x1=̂x2=x. However, when theplayer jumps into the air, the position errors dx1 and dx2 will be quitedifferent (as long as the vantage points are significantly different).More importantly, the two errors will be correlated, since they bothdepend directly on the height of the jump j. The distance x along theground between the player and the camera and height of the camera Czwill be different for both vantage points (but roughly constant over theduration of the jump). As a result, the errors are correlated, but notequal.

As mentioned above, the detection method of the exemplary embodiments isnot limited to detecting players of a sport. The player detectionembodiment models players as cylinders approximately 1.8 m tall and 0.5m wide. For detecting other types of objects, such as vehicles, thedetection method can be generalized to recognize constant height 3Dgeometric primitives, such as rectangular blocks and cylinders. Anycross-section through one of these primitives will generate a 2Drectangle (since the object has constant height). More importantly, theback projected camera rays which contain the rectangular cross-sectionwill result in an arbitrary convex quadrilateral when integrated over aparticular vertical range z=[0,H]. The previous triangular responsecorresponds to a particular instance of a convex quadrilateral. A givenrectangular cross-section of width w and height h lying on the groundwill generate two points p and s where the back projected rays from thecamera intersect the ground plane (see FIG. 4). A second plane parallelto the ground but elevated H will intersect the two rays at points q andr. Like the previous section on jumping, similar triangles can be usedto derive the relation between p and r

$\frac{r}{C_{z} - H} = \frac{p + w}{C_{z} - h}$$\frac{r}{p} = {\left( {1 + \frac{w}{p}} \right){\frac{C_{z} - H}{C_{z} - h}.}}$

The integrated response will be a:

-   -   quadrilateral if r<p,    -   triangle if r=p,    -   trapezoid if r>p.        Generally, the integration limit H will be similar to the object        height h. Furthermore, both w and p are greater than zero. As a        result, the expression for r/p will be greater than 1 the        majority of the time (indicating a trapezoid response). When w=0        and H=h, p is equal to r, which means a triangle response is        produced. Any slight deviation of the actual object height h        relative to the assumed height H will result in a non-triangle        response, and the mean-shift algorithm should pro-duce an answer        at the center of the cross-sectional rectangle (which is        equivalent to the midpoint of the top edge of the trapezoid when        h=H).

In the case of vehicle detection, vehicles can be coarsely modeled as 3Dblocks having particular width, height and depth. Unlike a cylinder, thewidth of the cross-sectional rectangle will change with the vehicle'sposition and orientation relative to the camera. However, since thecamera is calibrated and vehicles often align with the direction of theroad, one can infer the orientation of the 3D block for a given groundposition. As a result, the expected cross-sectional width is known whichmeans the expected trapezoidal response template is also known. Theheight-specific spatial occupancy map is calculated in the same fashion,although using a different maximum height limit. The template filteringmethod would be conceptually the same, but the template is no longer atriangle response and the sampling pattern is different (and possiblythe cost function—i.e., sum of absolute differences, normalized crosscorrelation). If the orientation of the vehicle is not known as afunction of position, multiple filters may be applied to test forsimilarity with respect to a selection of specific orientations.

FIG. 7 shows a block diagram of a camera 700 equipped to perform theabove-described object detection method. The camera 700 includes aconventional optics/video processing module 705 for producing a videosignal depicting an object, a GPU 710 and object detection module thattogether perform the above-described object-detection embodiments, and amemory 720 for storing such data structures as the occupancy map anddetection results. The camera 700 also is provided with a transmitterfor transmitting the low-bandwidth detection results wirelessly or overa wired network to a central location (e.g., a video truck outside ofthe sports venue or a remote studio) for further processing or storage.

FIG. 8 shows a plurality of cameras 700 deployed around a sports venue800, such as a stadium. As explained above, each camera 700, inperforming the embodiments described above, independently detects theplayers within its own vantage point and transmits its own low-bandwidthdetection results to a central facility 805, where the detection resultscan undergo further processing, such as resolving conflicting orambiguous player detections, or incorporating the results into areal-time or recorded program for distribution to viewers. Although FIG.8 shows only four cameras, the present invention, because it eliminatesthe need for full image search across different scales and produces alow-bandwidth output containing detection results, allows the detectionprocess to be scaled up to a large number of cameras. The practicallimit for the feasible number of cameras is a function of the dataaggregation speed at a central location.

What is claimed is:
 1. A method for detecting the location of objectsfrom a calibrated camera, comprising: receiving an image capturing anobject on a surface from a first vantage point; generating an occupancymap corresponding to the surface; filtering the occupancy map using aspatially varying kernel specific to the object shape and the firstvantage point, resulting in a filtered occupancy map; and estimating theground location of the object based on the filtered occupancy map. 2.The method of claim 1, wherein the object is a human, approximated by acylinder.
 3. The method of claim 1, wherein the object is a vehiclemodeled by a cuboid.
 4. The method of claim 1, wherein the occupancy mapis computed based on a plurality of horizontal planes located between afirst and a second plane, whereas the first plane corresponding to aheight of zero and the second plane corresponding to a height of theobject.
 5. The method of claim 1, wherein the occupancy map is generatedfrom a background mask extracted from the received image.
 6. The methodof claim 1, wherein the filtering includes applying a spatially varyingfilter to the occupancy map to identify each object matching a heightcriterion and a shape criterion.
 7. The method of claim 1, wherein thefiltering includes calculating a template matching score for eachlocation in the occupancy map by comparing a theoretical response to anactual response at each location.
 8. The method of claim 7, wherein thetemplate matching score is determined by calculating a sum of squareddifferences.
 9. The method of claim 8, wherein the template matchingscores for all locations in the occupancy map are searched for peaksmeeting a predetermined criterion and corresponding to object locations.10. The method of claim 1, wherein the geometric response is one of atriangle response and a trapezoid response.
 11. A system for detectingthe location of objects from a calibrated camera, comprising: anarrangement for receiving an image capturing an object on a surface froma first vantage point; an arrangement for generating an occupancy mapcorresponding to the surface; an arrangement for filtering the occupancymap using a spatially varying kernel specific to the object shape andthe first vantage point, resulting in a filtered occupancy map; and anarrangement for estimating the ground location of the object based onthe filtered occupancy map.
 12. The system of claim 11, wherein theobject is a human, approximated by a cylinder.
 13. The system of claim11, wherein the object is a vehicle modeled by a cuboid.
 14. The systemof claim 11, wherein the occupancy map is computed based a plurality ofhorizontal planes located between a first and a second plane, whereasthe first plane corresponding to a height of zero and the second planecorresponding to a height of the object.
 15. The system of claim 11,wherein the occupancy map is generated from a background mask extractedfrom the received image.
 16. The system of claim 11, wherein thefiltering includes applying a spatially varying filter to the occupancymap to identify each object matching a height criterion and a shapecriterion.
 17. The system of claim 11, wherein the filtering includescalculating a template matching score for each location in the occupancymap by comparing a theoretical response to an actual response at eachlocation.
 18. The system of claim 17, wherein the template matchingscore is determined by calculating a sum of squared differences.
 19. Thesystem of claim 18, wherein the template matching scores for alllocations in the occupancy map are searched for peaks meeting apredetermined criterion and corresponding to object locations.
 20. Thesystem of claim 11, wherein the geometric response is one of a triangleresponse and a trapezoid response.
 21. A computer-readable mediumcontaining instructions that when executed on a computing device resultsin a performance of the following: receiving an image capturing anobject on a surface from a first vantage point; generating an occupancymap corresponding to the surface; filtering the occupancy map using aspatially varying kernel specific to the object shape and the firstvantage point, resulting in a filtered occupancy map; and estimating theground location of the object based on the filtered occupancy map.